This study investigates the spatial distribution of assault incidents in Toronto in 2023, with a focus on the potential clustering of assaults near Toronto Transit Commission (TTC) subway routes. Utilizing assault occurrence data from the Toronto Police Service and TTC subway route, a spatial analysis was conducted to determine whether proximity to subway infrastructure influences the intensity and clustering of assault incidents.
In this study, we create different buffers around subway routes and classified incidents as occurring near or away from subway routes accordingly. In other words, we define the idea of “near TTC subway routes” using different buffer sizes. Preliminary findings indicated that 51.3% of assaults occurred within the 1km buffer, with higher densities observed near Line 1 and Line 2 routes compared to Lines 3 and 4. The study employs point pattern analysis and spatial modeling, including testing of complete spatial randomness, kernel density estimation, point process modelling, cluster detection through HDBSCAN, to evaluate spatial dependence.
The spatial analysis of Toronto assault cases reveals significant clustering patterns, rejecting the null hypothesis of complete spatial randomness (CSR) across all tests. Kolmogorov-Smirnov tests and Ripley’s K-function demonstrate significant clustering of assault cases in Toronto and degree of clustering depends on whether the assault occurred within the TTC subway route buffer (i.e. near TTC subway). Comparison of Ripley’s K-function on different within buffer subsets reveals that clustering patterns depend on the buffer size as well (i.e. proximity to TTC subway route). Clustering analysis on different data subsets through HDBSCAN shows that assault incidents near TTC subway routes remain concentrated within 1-2 km buffers and retain their spatial patterns when expanded to 5 km and the entire city. Larger datasets revealed additional smaller clusters, suggesting that TTC subway routes may influence the distribution of assault incidents.
Research findings can provide insights for urban planners and policymakers, informing strategies to enhance public safety around transit infrastructure. This research contributes to understanding the interplay between urban crime patterns and public transit systems, with implications for cities beyond Toronto.
Public safety is a central concern in urban planning, particularly in cities like Toronto where public transit systems, such as the Toronto Transit Commission (TTC), are heavily utilized. With millions of commuters relying on the subway system daily, it is crucial to ensure that these transit corridors are safe for all users. Assault cases occurring in and around transit hubs have raised concerns about whether the infrastructure and social environment surrounding subway lines contribute to crime. The subway lines, as major transit arteries, can influence the surrounding environment in various ways. High volumes of pedestrian traffic, densely populated areas, and varied socio-economic conditions around the stations might contribute to spatial patterns of crime that differ from other parts of the city.
This research seeks to investigate whether there is a significant clustering of assault cases occurring near TTC subway routes compared to other areas in Toronto. We aim to explore whether the spatial distribution of assault incidents follows random patterns across Toronto or shows a tendency to concentrate in proximity to TTC subway lines.
The findings provide insights into how transit infrastructure may interact with urban crime patterns and offer recommendations for transit authorities and policymakers to improve safety measures. If significant clustering is detected, the results could influence the allocation of law enforcement resources, guiding increased security measures around certain subway stations. Urban planners could also use the findings to design safer transit environments, incorporating infrastructure and urban designs that discourage crime. Furthermore, this research could have broader implications for understanding the relationship between public transit systems and urban crime, contributing to policies aimed at enhancing public safety not just in Toronto, but in other urban centers with similar transit networks.
The study focuses on spatial analysis of assault incidents in Toronto in 2023, specifically examining their spatial distribution of assault incidents across transit zones (TTC subway buffer zones) to understand the clustering patterns of assaults in relation to transit infrastructure.
We utilize two datasets:
This dataset includes records of assault occurrences in Toronto reported to the Toronto Police Service since 2014. The data is point process data, with each observation representing an individual assault incident’s spatial location with coordinates (longitude and latitude) using the WGS84 datum. The full dataset consists of 206956 observations. Aside from the spatial information about the incident, other attributes include the followings:
For this study, we focus on incidents occurred in 2023 only. The reduced dataset includes 6,478 distinct locations where assaults occurred in 2023, and in total, there are 23,639 individual assault incidents recorded for this year. Note that multiple assaults may have occurred at the same locations. To answer our main research question, we only require the spatial location of the point observations that denotes an occurrence of an assault incident.
This dataset, provided by the City of Toronto, contains geographic shapefiles representing the Toronto Transit Commission (TTC) subway lines. The spatial data is line data, representing each subway line’s path through geographic coordinates (longitude and latitude) using the WGS84 datum. This dataset consists of 4 observations, each representing one subway line within Toronto’s transit system. Aside from the spatial information about the incident, other attributes include the followings:
For this study, we consider all 4 observations in our study domain since we are interested in to understand incident patterns near all subway lines.
Below is a map visualizing the assault incidents occurred in Toronto in 2023 and 4 TTC subway lines.
Figure 1: Map visualizing the assault incidents occurred in Toronto in 2023 and 4 TTC subway lines. Each red point represents a single assault incidents. The yellow line represents Line 1 subway route, green line represents Line 2 subway route, blue line represents Line 3 subway route, purple line represents Line 4 subway route. The grey polygon windows the Toronto region.
We first produce 1km, 2km and 5km spatial buffers along the TTC subway routes. Regions inside the buffer are considered as near TTC subway routes and region outside the buffer are considered as not near TTC subway routes.
Then, we perform spatial join to merge the assaults dataset with the buffer dataset by geometry location intersection. Hence, assault incidents that occurred within the specific buffer area are considered as as “assault occurred near TTC subway area”. For assault incidents with location outside of the specific buffer area, we identify them as “assault occurred outside the TTC subway area”.
In other words, we create 3 additional binary covariates, each representing whether the assault incident occurred within the 1km, 2km and 5km spatial buffers respectively.
In total, there were 23639 individual assault incidents recorded in Toronto in 2023. 10162 of them occurred within the TTC subway buffer, i.e. near TTC subway routes, this accounts for around 42.99% of the total number of assault incidents. 13477 assaults incidents occurred outside the TTC subway buffer, accounting for around 57.01% of the total number of assault incidents. Hence, there is a relatively fair split of assault incidents between near TTC subway routes and away from TTC subway routes. Below is a table showing the proportion of assault incidents occurred within each subway route buffer.
| Buffer size | Number of assaults within buffer | Proportion of assaults within buffer (%) |
|---|---|---|
| 1km | 10162 | 42.99 |
| 2km | 14648 | 61.97 |
| 5km | 21618 | 91.45 |
Table 1: Summary of assaults cases occurred within TTC subway routes buffers. This table provides information to number and proportion of assault cases occurred within different buffers varied by radius size. As expected, we observe an increase in number and proportion of assault cases as the buffer size increases, with the 5km buffer covering over 90% of cases.
| Subway route name | Length | Area (1km buffer) | Area (2km buffer) | Area (5km buffer) |
|---|---|---|---|---|
| LINE 1 (YONGE-UNIVERSITY) | 38.89 [km] | 99.30 [km^2] | 189.44 [km^2] | 458.27 [km^2] |
| LINE 2 (BLOOR - DANFORTH) | 26.19 [km] | 73.43 [km^2] | 150.41 [km^2] | 429.16 [km^2] |
| LINE 3 (SCARBOROUGH) | 6.62 [km] | 20.24 [km^2] | 47.93 [km^2] | 175.20 [km^2] |
| LINE 4 (SHEPPARD) | 5.37 [km] | 17.56 [km^2] | 43.16 [km^2] | 166.54 [km^2] |
Table 2: TTC Subway route buffer measurements. This table lists out the length and buffer size of eachof the 4 TTC subway line.
Line 1 and Line 2 are significantly longer in length compared to Line 3 and Line 4. Accordingly, Line 1 and Line 2 buffers are also significantly larger in terms of area.
| Subway route name | Number of Assaults (1km buffer) | Proportion % (1km buffer) | Number of Assaults (2km buffer) | Proportion % (2km buffer) | Number of Assaults (5km buffer) | Proportion % (5km buffer) |
|---|---|---|---|---|---|---|
| LINE 1 (YONGE-UNIVERSITY) | 6385 | 27.01 | 8704 | 36.82 | 13737 | 58.11 |
| LINE 2 (BLOOR - DANFORTH) | 4452 | 18.83 | 8505 | 35.98 | 15777 | 66.74 |
| LINE 3 (SCARBOROUGH) | 741 | 3.13 | 1543 | 6.53 | 4354 | 18.42 |
| LINE 4 (SHEPPARD) | 558 | 2.36 | 900 | 3.81 | 2786 | 11.79 |
Table 3: Summary statistics of assaults cases by TTC subway routes. This table demonstrates the number and proportion of assault cases occurred within each subway line’s buffer. Each number/proportion column represents number/proportion of assault cases for a specific buffer size as indicated in the parentheses. Note: The sum of number of assaults across all subway line under the same buffer size does not necessarily add up to Table 2 total number of assaults figures. This is because assault incidents may possible fall into more than one line’s buffer zone and they are double counted in this table.
Overall speaking, we observe that regions (regardless of buffer size) near Line 1 has the highest number and proportion of assaults incidents, followed by regions near Line 2. Regions near Line 3 and Line 4 has significantly lower proportion of assaults incidents. These findings are consistent across all buffer size. As expected, regardless of which subway line, we observe an increase in number and proportion of assault cases as the buffer size increases
| Subway route name | Density (1km buffer) | Density (2km buffer) | Density (5km buffer) |
|---|---|---|---|
| LINE 1 (YONGE-UNIVERSITY) | 64.30 [1/km^2] | 45.95 [1/km^2] | 29.98 [1/km^2] |
| LINE 2 (BLOOR - DANFORTH) | 60.63 [1/km^2] | 56.55 [1/km^2] | 36.76 [1/km^2] |
| LINE 3 (SCARBOROUGH) | 36.61 [1/km^2] | 32.19 [1/km^2] | 24.85 [1/km^2] |
| LINE 4 (SHEPPARD) | 31.78 [1/km^2] | 20.85 [1/km^2] | 16.73 [1/km^2] |
Table 4: Assaults cases density by subway routes. This table demonstrates the densities of assault cases occurred within each subway line’s buffer. Each density column represents densities for a specific buffer size as indicated in the parentheses.
In 1km buffer, we observe that Line 1 and Line 2 subway route buffers have similar assault incidents densities. Additionally, Line 3 and Line 4 subway route buffers also have similar assault incidents densities, but they are nearly half of that observed from Line 1 and Line 2 buffers. In 2km and 5km buffers, we also observe that Line 1 and Line 2 subway route buffers have higher assault incidents densities compared to Line 3 and Line 4 subway route buffers’ densities. However, the differences between them are smaller than that found in 1km buffer. Overall speaking, as buffer size increases, regardless of which subway line, the assault incidents densities decreases. This implies that there may be less assault cases occurred in locations more distance away from the TTC subway lines.
To study if there are clustering of assault cases near TTC subway routes, point pattern analysis is conducted. We have created different buffers around the TTC subway routes, ranging from radius of 1 km, 2km, and 5 km (which can cover over 90% of data points as mentioned in Table 1).
Recall that we created 3 additional binary covariates, each representing whether the assault incident occurred within the 1km, 2km and 5km TTC subway buffers respectively. To get a rough understanding of how proximity to TTC subway routes relates to a point process, we conduct the Kolmogorov-Smirnov test, which is a non-parametric statistical test used to compare distributions. In our analysis, we use this test to evaluate whether the spatial distribution of assault cases is influenced by proximity to the TTC line by comparing the observed distribution of assault cases to the expected distribution based on a 1 km/ 2km/ 5km buffer zone around the TTC. Since we have 3 “within buffer” variables, we conducted 3 tests with
With the significant p-values, we may reject \(H_0\), suggesting that assault cases are spatially dependent on the covariate, indicating potential clustering or dispersion near the buffer. This approach helps determine if assault cases are randomly distributed or if they exhibit a spatial pattern related to the buffers, providing insight into potential clustering near the transit line.
We wish to test if all assault incidents are uniformly distributed across the entire Toronto region and are independent of each other, i.e. complete spatial randomness (CSR). CSR means an event (an assault incident) is equally likely to occur at any location or region within the domain (Toronto).
Quadrat counting
Firstly, we employ the technique of quadrat counting to visualize how the intensity of assault incidents varies across Toronto by creating a grid (often called quadrats) and counting the number of assault incidents in each grid cell. In general, if the point pattern follows CSR, we expect to observe random number of points across all quadrats; if the point pattern is clustered, we expect some quadrats have significantly higher number of points; if the point pattern is regular, we expect all quadrats to have similar number of points.
Additionally, we use the r function quadrat.test() to
test for CSR, clustered (points are concentrated in some quadrats), or
regular (points are evenly spaced across quadrats). In other words, we
would conduct 3 quadrat.test() by specifying different
alternative hypothesis:
Since we conducted multiple testing (3 tests), we adjust the p-value threshold as \(0.05/3 = 0.016667\) to prevent inflated Type-I error. For each test, if the p-value obtained is lower than the threshold, we can reject the null hypothesis. In this study specifically, if clustering pattern is in fact present, we expect to obtain significant p-values for Test 1 and Test 3 and insignificant p-value for Test 2.
Ripley’s K function
To test for CSR (i.e. number of points are random across all quadrats), clustered (i.e. points are concentrated in some quadrats), or regular (i.e. all quadrats have similar number of points), we additionally calculate the Ripley’s K-function (with L function adjustment) on the full dataset without setting any buffer restriction.
The Ripley’s K function: \(K(r) = \lambda^{-1}E(N_0(r))\) where \(N_0(r)\) is the number of events within a distance h of an arbitrary event, represents the expected number of events within distance \(h\) from an arbitrary events (excluding the chosen event itself) divided by the average number of events per unit area. Under the null hypothesis CSR, \(K(r) = \pi r^2\). It tests whether the observed number of points within a given distance from any point in the dataset are significantly different from what would be expected under CSR. The L function is a transformation of the K function to make the interpretation easier. Specifically, \(L(r) = \sqrt{K(r) / \pi} - r\), which makes the expected value for a random pattern equal to 0 at all distances. The L function helps to linearize the K function, making it easier to compare the observed pattern with a random distribution. In general, positive values of \(L(r)\) suggest clustering, while negative values suggest regularity. Note that we impose boundaries correction for all Ripley’s K function estimations.
Ripley’s K function is applied to the complete dataset to test for complete spatial randomness (CSR), as ruling out CSR would validate the presence of spatial dependence and provide a basis for further clustering analysis (if L function produce a curve well above 0) of assault cases near the TTC line.
G-function
The G function is the cumulative distribution of the distances between nearest neighbors. The observed G function \(\hat{G}(r)\) is the proportion of observed points with nearest neighbors less than \(r\). Under CSR: \(G(r) = 1- e^{-\lambda \pi\ r^2}\). G test is applied to examine the distribution of nearest neighbors between points. It tests whether the observed distances between points in a pattern are significantly different from what would be expected under CSR by comparing the cdf between nearest neighbors in the observed data to that expected under CSR. If \(\hat{G}(r)\) is much greater than G(r), that means there is clustering, whereas if it is smaller that means there is regularity. Note that we also impose boundaries correction for the G function estimation.
The main difference between K function and G function is that: K function measures the number of events found up to a given distance of any particular event (i.e. uses pairwise distances) and tests on differences in terms of number of points, while G function measures the distribution of distances from an arbitrary event to its nearest event (i.e. uses nearest neighbor distances) and tests on differences in terms of distribution of nearest neighbors between points.
Next, we subset data points according to the 3 “within buffer” variables and obtain 3 sets of data, each set representing assault cases occurred within 1km/ 2km/ 5km TTC subway buffer respectively. For each data points subset, we refit the Ripley’s K function with L function adjustment. We conduct pairwise Kolmogorov-Smirnov test to statistically test the differences in Ripley’s K function curves obtained from different buffer sizes data to quantify the differences in the clustering patterns across different buffer sizes. Once again, since we are conducting multiple testing (3 tests) here, we need to adjust for the p-value threshold as \(0.05/3 = 0.016667\) in prevent inflated Type-I error. For each test, if all p-values obtained are lower than the threshold, we can reject the null hypothesis that Ripley’s K functions of data points within the 3 buffer zones are the same. This helps us to understand if the degree of clustering differ across buffer size. Note that we impose boundaries correction for all Ripley’s K function estimations.
We use Kernel density estimation (KDE) to estimate the intensity
function non-parametrically through kernel smoothing. The non-parametric
form is \(\hat{\lambda(s)} =
\frac{1}{h^2}\sum_i{K(\frac{||s-s_i||}{h})/q(||s||)}\) where
\(K(s)\) is a kernel function and \(q||s||\) is a boundary correction, \(||s-s_i||\) is the distance between
location \(s\) and observed point \(s_i\). In this analysis, we use an
isotropic Gaussian kernel \(K(s) =
\frac{1}{\sqrt{2\pi h^2}}exp(-\frac{s^2}{2h^2})\) where \(s\) is the distance from the point where
the density is being estimated and h is the bandwidth that controls the
degree of smoothing. Hence, estimating the density function is done by
\(\hat{f(s)} =
\frac{1}{n}\sum_i{K(\frac{s-s_i}{h)})} =
\frac{1}{n}\sum_i\frac{1}{\sqrt{2\pi h^2}}
exp(-\frac{(s-s_i)^2}{2h^2})\). Each data point \(s_i\) contributes a Gaussian-shaped bump to
the density estimate, centered at \(s_i\) and with spread controlled by \(h\). The estimated density \(\hat{f}(s)\) at \(s\) is the average of these contributions.
Since we do not know the kernel bandwidth, we estimate an optimal value
using cross-validation by bw.diggle(). This function uses a
leave-one-out cross-validation (LOOCV) criterion to estimate the optimal
bandwidth, that is for each point in the spatial point pattern, it
estimates the density at that point using the kernel density estimator,
excluding the contribution of the point itself.
We estimate and plot the density on full dataset and the 3 “within buffer” data subsets to get varying density estimates within the buffers and visualize the differences across different buffer sizes.
We fit several poisson process models on the full dataset using the 3 “within buffer” variables. Although poisson process models assume that incidents occurred under CSR environment (assault cases occur independently of one another) and with potential evidence of clustering obtained by Ripley’s K function analysis, we still opt for fitting poisson process model since it provides a simple baseline for assault incidents distributions and can serve as a reference point to quantify how much the actual data deviates from randomness, even if clustering is present. In clustered data, subsets of the data might still adhere to Poisson behavior. So, fitting the Poisson process locally or accounting for how different these subsets of data are (i.e. using the 3 “within buffer” variables) in the model can still yield useful information.
The first mode we fit is a homogeneous Poisson process model, assuming that the intensity is constant over the region without using any “within buffer” variable. Then, we fit 3 other inhomogeneous Poisson process models (which still assume CSR), assuming that the intensity is not constant over the region but is a function that varies spatially and depends only on the corresponding “within buffer” variable. In this case, the intensity function is defined as a log-linear model: \[ \lambda(x, y) = \exp(\beta_0 + \beta_1 \text{"within x km buffer"}), \] where:
After fitting the models, we can obtain parameter estimates and their confidence intervals for each model. In particular, for the inhomogeneous models with the “within buffer” variable as additional covariate, we use the Z-test result to determine whether they create significant differences to the intensity. If these “within buffer” variables are found significant, we may conclude that intensity does vary depending on buffer size, i.e. whether assault incidents occurred near (and how near) to TTC subway lines.
Lastly, we fit cluster process models, Hierarchical Density-Based
Spatial Clustering of Applications with Noise (HDBSCAN) using the
hdbscan() function, to examine the clustering property of
assault incidents. DBSCAN (Density-Based Spatial Clustering of
Applications with Noise) is a clustering algorithm that groups points
into dense regions of a given region. It uses two main parameters:
eps: The maximum radius of a neighborhood for a point
to be considered as part of a cluster.minPts: The minimum number of points required to form a
cluster.HDBSCAN on the other hand does not require \(\epsilon\) neighbourhood but still require the minimum number of points that we wish to have in a cluster. It uses the concept of mutual reachability, where we look at distances that connect all the points.
In this analysis, we specify the minimum number of points required to form a cluster as 100.
We conduct HDBSCAN on the full dataset and 3 “within buffer” subsets of data points. For each sets of clusters, we visualize them on a plot and compare between the plots to see if there are significant differences in the location, shape and size of clusters formed, especially those formed on the TTC buffer zones. Significant clustering patterns within or around the TTC subway routes buffer could suggest non-random spatial dependence, implying that TTC subway routes may influence the spatial distribution of assault cases. Note that we expect that as the number of points in the data increases, number and potentially the size of clusters would increase. We are particularly interested to see if the clusters formed on smaller buffer data subset would disappear as the data expand to the full dataset. For instance, using the 1km buffer data subset, clusters must occur on the 1km buffer region, we would check if these cluster would disappear in 2km buffer data, 5km buffer data or full dataset.
Ideally, if clustering does occur near TTC subway zones, we should expect similar clusters (in terms of location) obtained from smaller buffer data subsets (i.e. 1km and 2km). As buffer size increases (i.e. to 5km or using full dataset), these clusters should still retain (perhaps with increased size) and any additional clusters occurring outside of the 1km/ 2km buffer zones should be significantly smaller in size.
The results of the Spatial Kolmogorov-Smirnov (KS) tests indicate a significant deviation from complete spatial randomness (CSR) in the distribution of assaults in Toronto when evaluated against covariates at different spatial scales (buffers of 1 km, 2 km, and 5 km). For the 1 km buffer, the test statistic \(D = 0.39218\) with a p-value < \(2.2 \times 10^{-16}\) suggests a strong departure from uniformity under CSR. Similarly, for the 2 km buffer, \(D = 0.38294\) with a p-value < \(2.2 \times 10^{-16}\) confirms this pattern. At the 5 km scale, the deviation is even more pronounced, with \(D = 0.49764\) and a p-value < \(2.2 \times 10^{-16}\). These results consistently reject the null hypothesis that the spatial distribution of assault cases is independent of the buffer covariate ( 1km/ 2km/ 5km “within buffer” variable). This suggests that assault cases are spatially dependent on the covariate, indicating potential clustering or regularity near the buffer.
Figure 2: Quadrat Counting plot of Toronto assault cases. This Quadrat Counting plot illustrates the spatial distribution of assault cases in Toronto. Each cell in the grid represents the count of incidents within that area, revealing a clear pattern of clustering. Higher counts are concentrated in the central region, while peripheral areas show significantly lower or zero cases.
The Quadrat Counting plot (Figure 2) reveals that the distribution of assault cases is highly uneven, with distinct clusters of higher counts in the south, north-west and east regions, indicating areas of concentrated criminal activity. Note that those areas are consistent with area covered by Line 1 and Line 2 TTC subway routes. This pattern highlights spatial heterogeneity in assault occurrences, likely driven by underlying urban factors, such as the TTC subway lines.
The results of the Conditional Monte Carlo tests using quadrat counts indicate significant deviations from complete spatial randomness (CSR) and regular pattern.
| Null Hypothesis | Alternative Hypothesis | p-value | Significance |
|---|---|---|---|
| The point pattern follows CSR | The point pattern deviates from CSR. | 0.001 | Significant (p < 0.0167) |
| The point pattern follows CSR or regular pattern | The point pattern follows a clustering pattern | 5e-04 | Significant (p < 0.0167) |
| The point pattern follows CSR or clustering pattern | The point pattern follows a regular pattern | 1 | Not Significant (p ≥ 0.0167) |
Table 5: Quadrat tests setting and results. This table outlines the null hypothesis, alternative hypothesis, p-value and whether significant result is found for each test.
As mentioned earlier, since we are conducting multiple testing (3 tests) here, we asjusted for the p-value threshold as \(0.05/3 = 0.016667\) in prevent inflated Type-I error.
When testing the two-sided alternative hypothesis with a p-value of 0.001, we reject the null hypothesis of CSR and suggesting non-randomness in the distribution. The second test with regularity alternative hypothesis resulted in a p-value of 1, indicating that insufficient strength to reject the idea that the point pattern follows CSR or clustering pattern and there is no evidence of a regular spatial pattern. Conversely, the third test with clustering alternative hypothesis resulted in a p-value of 0.0005, strongly supporting the presence of a clustered spatial pattern. These results confirm that the observed data exhibit significant clustering rather than randomness or regularity.
Figure 3: L-function plot for Toronto assault cases, with CSR envelopes. The black line indicates the observed \(K(r)\) curve and the red dashed line with grey evelopes is the expected function under CSR. Note that the observed \(K(r)\) curve (black line) consistently exceeds the CSR expectation (red dashed line within gray envelopes), indicating significant clustering at multiple spatial scales.
The L-function plot (Figure 3) demonstrates strong evidence of spatial clustering in Toronto assault cases. The observed \(K(r)\) values lie well above the CSR envelopes across all spatial scales (\(r\)) distance (meter), showing that the observed distribution deviates significantly from complete spatial randomness. This pattern suggests that assaults are not uniformly distributed but instead tend to occur in clusters. This finding is consistent with the quadrat count test results found in section 4.22.
Figure 4: G-function plot for Toronto assault cases, with CSR envelopes. The black line indicates the observed \(G(r)\) curve and the red dashed line with grey evelopes is the expected function under CSR. Note that the observed \(G(r)\) curve (black line) consistently exceeds the CSR expectation (red dashed line within gray envelopes), indicating significant clustering at multiple spatial scales.
The observed G-function (Figure 4) is significantly greater than theoretical CSR \(G(r)\) and generally remains outside the CSR envelopes, indicating that the spatial pattern of assaults does deviate significantly from randomness at the analyzed distances. It also suggests strong evidence of clustering in the distribution of assault cases. This finding is consistent with the test results found in section 4.22 Quadrat Test and section 4.23 K-function.
Figure 5: L-function plot for Toronto assault cases occurred within the 1km subway buffer, with CSR envelopes. The black line indicates the observed \(K(r)\) curve and the red dashed line with grey evelopes is the expected function under CSR. Note that the observed \(K(r)\) curve (black line) consistently exceeds the CSR expectation (red dashed line within gray envelopes), indicating significant clustering at multiple spatial scales.
Figure 6: L-function plot for Toronto assault cases occurred within the 2km subway buffer, with CSR envelopes. The black line indicates the observed \(K(r)\) curve and the red dashed line with grey evelopes is the expected function under CSR. Note that the observed \(K(r)\) curve (black line) consistently exceeds the CSR expectation (red dashed line within gray envelopes), indicating significant clustering at multiple spatial scales.
Figure 7: L-function plot for Toronto assault cases occurred within the 5km subway buffer, with CSR envelopes. The black line indicates the observed \(K(r)\) curve and the red dashed line with grey evelopes is the expected function under CSR. Note that the observed \(K(r)\) curve (black line) consistently exceeds the CSR expectation (red dashed line within gray envelopes), indicating significant clustering at multiple spatial scales.
According to Figure 5-7, the L-function plots demonstrate strong evidence of spatial clustering in Toronto assault cases, regardless of using which buffer size specific data subsets. Across all buffer sizes, the \(K(r)\) values significantly exceed the expected \(K(r)\) under complete spatial randomness (gray shaded region), indicating clustering rather than randomness. The clustering effect is most pronounced in the 1 km buffer, where \(K(r)\) rises rapidly and peaks earlier compared to the other buffers. The 2 km buffer plot shows a similar pattern but with a slightly reduced clustering intensity and a broader peak. The 5 km buffer exhibits the least steep increase, indicating that same degree of clustering spatial pattern occurred at larger scales. These differences suggest that the observed clustering becomes less concentrated as the buffer size increases, which might reflect varying scales of spatial dependence or density across the study area.
We employ Kolmogorov-Smirnov test to statistically test the differences in Ripley’s K function curves obtained from different buffer sizes data to quantify the differences in the clustering patterns across different buffer sizes. As mentioned earlier, since we are conducting multiple testing, we adjust for the p-value threshold as \(0.05/3 = 0.016667\) in prevent inflated Type-I error. Here is a summarized table of the results:
| Buffer Comparison | Test Statistic (D) | p-value | Significance |
|---|---|---|---|
| 1 km vs. 2 km | 0.68031 | < 2.2 × 10\(^{-16}\) | Highly Significant |
| 2 km vs. 5 km | 0.28265 | < 2.2 × 10\(^{-16}\) | Highly Significant |
| 1 km vs. 5 km | 0.53996 | < 2.2 × 10\(^{-16}\) | Highly Significant |
Table 6: Kolmogorov-Smirnov tests results. This table outlines the pairwise function curves comparison, test statistic, p-value and whether significant result is found for each test.
The results confirm significant differences in clustering patterns between all buffer sizes, with the largest difference observed between 1 km and 2 km buffers followed by 1 km vs. 5 km buffers. These findings suggest that the spatial clustering patterns may vary substantially depending on the buffer size, which has implications for the clustering models HDBSCAN. Specifically, the choice of buffer size data subset may strongly influence the resulting cluster structure and the scale at which clusters are identified.
Figure 8: Estimated density: 1km buffer Toronto assault cases. This kernel density estimation plot shows the overall spatial distribution of assault cases occurred within 1km buffer of TTC subway lines. Darker blue areas represent lower densities, while warmer colors (pink, yellow) signify higher densities of assault cases.
Figure 9: Estimated density: 2km buffer Toronto assault cases. This kernel density estimation plot shows the overall spatial distribution of assault cases occurred within 2km buffer of TTC subway lines. Darker blue areas represent lower densities, while warmer colors (pink, yellow) signify higher densities of assault cases.
Figure 10: Estimated density: 5km buffer Toronto assault cases. This kernel density estimation plot shows the overall spatial distribution of assault cases occurred within 5km buffer of TTC subway lines. Darker blue areas represent lower densities, while warmer colors (pink, yellow) signify higher densities of assault cases.
Figure 11: Estimated density: all Toronto assault cases. This kernel density estimation plot shows the overall spatial distribution of all assault cases across Toronto. Darker blue areas represent lower densities, while warmer colors (pink, yellow) signify higher densities of assault cases.
Optimal bandwidth :
Density estimates:
Density plots:
Overall speaking, these is no too much differences in the density estimates (Figures 8-11) across all 4 set of data. In general, we observe higher density areas concentrated in downtown Toronto. There are slightly more locations in downtown area with higher density estimates based on smaller buffer size data subsets (Figures 8-9). Meanwhile, using the 5km buffer size data subset and full dataset, we observe another high density location (indicated as a yellow spot) at the south-west corner of Toronto (Figures 10-11).
We fit several poisson process models on the full dataset. Below is a summary of the model results:
| Model | Parameter term | Coefficient estimate | SE of Estimate | Confidence interval of Estimate | AIC |
|---|---|---|---|---|---|
| Homogeneous model (constant intensity) | Intercept | −10.24358 | 0.006504074 | [-10.25632, -10.23083] | 531575.7 |
| Inhomogeneous model with “within 1km buffer” variable | Intercept | −10.5535641 | 0.008616209 | [-10.5704515, -10.53668] | 526388.6 |
| within 1km buffer | 0.9757522 | 0.013136862 | [0.9500044, 1.00150] | ||
| Inhomogeneous model with “within 2km buffer” variable | Intercept | −10.6621427 | 0.01053683 | [-10.6827945, -10.6414909] | 527691.3 |
| within 2km buffer | 0.8090311 | 0.01339284 | [0.7827816, 0.8352806] | ||
| Inhomogeneous model with “within 5km buffer” variable | Intercept | −11.0634765 | 0.02215667 | [-11.1069028, -11.0200502] | 529296.1 |
| within 2km buffer | 0.9500268 | 0.02317779 | [0.9045992, 0.9954544] |
Table 7: Fitted Point Process models results. This table outlines the model description, coefficient estimates, standard error and confidence interval for each parameter, as well as the AIC value for each model.
We interpret the intercept estimates as baseline log intensity and the buffer variable coefficient estimates as change in log intensity when the assault case occurred within that specific buffer. Additionally, all intercept estimates and buffer variable were tested as statistically significant (by Z-test), implying that the baseline log intensity is non-zero and buffer variables create significant differences to the intensity.
The results demonstrate that proximity to TTC subway routes significantly influences the density of assault cases in Toronto. The inhomogeneous model with a “within 1km buffer” variable provides the best fit to the data, with the lowest AIC value, indicating that areas within 1km of transit infrastructure have the highest relative risk of assault. Specifically, assault cases are 2.7 times more likely to occur within this zone compared to areas outside the buffer, as evidenced by a log estimate of 0.976. The effect diminishes as the buffer expands, with assaults being 2.2 times more likely within 2km and 2.6 times more likely within 5km, though the fit of these models is slightly poorer. These findings highlight the spatial variability of assault density and suggest that targeted safety interventions should focus on areas close to transit infrastructure, particularly within the 1km radius, to maximize their impact on public safety.
Figure 12: Clustering of assault incidents within 1 km of TTC subway routes. Each cluster consist of at least 100 data points. 15 clusters are formed.
Figure 13: Clustering of assault incidents within 2 km of TTC subway routes. Each cluster consist of at least 100 data points. 18 clusters are formed.
Figure 14: Clustering of assault incidents within 5 km of TTC subway routes. Each cluster consist of at least 100 data points. 30 clusters are formed.
Figure 15: Clustering of all assault incidents in Toronto. Each cluster consist of at least 100 data points. 33 clusters are formed.
Figure 12-14 represent clustering analysis of assault incidents occurring within 1 km/ 2km/ 5km of TTC subway routes in Toronto, while Figure 15 represent that of all Toronto assault incidents. Each cluster consists of at least 100 data points. Each colored cluster indicates a distinct geographical grouping of incidents, identified using spatial clustering method HDBSCAN. The clusters highlight areas with higher concentrations of assaults, suggesting potential hotspots near subway routes. The black points scattered across the map denote individual incident locations that are not clustered and are considered as noise point, while the color-coded regions provide insights into patterns of spatial proximity and density.
1km and 2km buffer datasets (Figures 12-13) generate clusters (15 and 18 respectively) at similar location while clusters in 2km buffer are slightly larger in size. As we increased the buffer size to 5km and lastly to the full dataset, those identified near TTC subway route clusters using 1km and 2km buffer data in general retain and slightly increase in size (which is expected given more points are available in larger datasets). Note that there are non-negligible of additional clusters that are observed in 5km buffer data or full data but not in 1km or 2km buffer data (almost double the amount of clusters were discovered with larger dataset). However, they are in general smaller in size compared to the main near TTC subway route clusters. There are two significantly large additional clusters found on the south-west corner of Toronto.
Overall speaking, significant clustering patterns found within or around the TTC subway routes 1km/ 2km buffer still retains as we expand the study area to the entire Toronto region, implying that TTC subway routes may influence the spatial distribution of assault cases.
The analysis reveals significant spatial clustering of assault cases in Toronto, particularly influenced by proximity to TTC subway routes. Firstly, Kolmogorov-Smirnov tests concludes that intensity of assault cases depends on whether they occurred with the 1 km, 2 km, and 5 km buffer zones. Quadrat, Ripley’s K-function and G-function tests further reject the null hypothesis of complete spatial randomness (CSR) and confirm the presence of clustering in the assault incident point pattern. Comparison of Ripley’s K-function on different within buffer subsets reveals that clustering patterns depend on the buffer size as well (i.e. proximity to TTC subway route). Secondly, kernel density estimation shows that downtown Toronto region is consistently estimated with high density of assault incidents across all within buffer subsets of data points. Thirdly, poisson process models incorporating thebuffer covariates outperform the homogeneous model, indicating that spatial variation in assault incidents correlates with proximity to subway routes. Lastly, the HDBSCAN clustering analysis revealed significant concentrations of assault incidents near TTC subway routes in Toronto, with distinct clusters identified within 1 km and 2 km buffers. These clusters retained their general location and slightly increased in size as the buffer was expanded to 5 km and the entire city, indicating that the spatial influence of subway routes persists across broader study areas. Larger datasets also revealed additional clusters, particularly in the southwest of Toronto, though these were smaller in size compared to the main clusters near subway routes.
In conclusion, the findings of this analysis indicate that assault incidents in Toronto exhibit a clustering point pattern rather than being randomly distributed across the city. This clustering suggests that certain areas experience disproportionately high levels of such incidents, creating identifiable “hotspots” that could warrant further investigation. The spatial relationship between these clusters and the Toronto Transit Commission (TTC) subway routes suggests that the transit system may play a significant role in influencing the distribution of assault incidents. This influence could stem from several factors, including high population density and increased foot traffic around subway stations, which often serve as hubs of activity. Additionally, the movement of individuals along these transit routes could create opportunities for encounters, both positive and negative, within these high-traffic zones.
Identifying and understanding these hotspots is crucial for developing targeted interventions aimed at reducing assault incidents. Such interventions might include enhancing security measures at or near subway stations, increasing surveillance or police presence, improving lighting and visibility in these areas, or fostering community engagement to address underlying social issues. Overall, this research underscores the importance of integrating spatial analysis into urban planning and public safety strategies. By recognizing and addressing the spatial dynamics of assault incidents, stakeholders can implement more effective and localized solutions to enhance safety and security for all residents and commuters in Toronto.
While this study provides valuable insights into the spatial clustering of assault incidents near TTC subway routes, several limitations must be acknowledged. First, the analysis is constrained to reported assault incidents, which may underrepresent the true number of assaults due to underreporting or data recording discrepancies. Second, the study assumes that proximity to subway routes directly correlates with transit-related activity, but it does not account for other factors that may influence assault occurrences, such as socio-economic conditions, land use, or time-of-day variations. Third, the static nature of the buffer zones does not consider dynamic population movements or fluctuations in transit ridership, which may affect assault patterns. Additionally, the reliance on a single year of data limits the ability to identify long-term trends or temporal variations. Finally, spatial modeling techniques may be influenced by assumptions about the underlying distribution of assault incidents, potentially oversimplifying complex interactions. These limitations highlight the need for further research incorporating additional data sources, temporal analyses, and contextual variables to provide a more comprehensive understanding of crime patterns near urban transit systems.